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Abstract 

Setting: During endoplasmic reticulum (ER) stress, the endoribonuclease (RNase) Irela initiates removal of a 26 nt region 
from the mRNA encoding the transcription factor Xbpl via an unconventional mechanism (atypically within the cytosol). 
This causes an open reading frame-shift that leads to altered transcriptional regulation of numerous downstream genes in 
response to ER stress as part of the unfolded protein response (UPR). Strikingly, other examples of targeted, unconventional 
splicing of short mRNA regions have yet to be reported. 

Objective: Our goal was to develop an approach to identify non-canonical, possibly very short, splicing regions using RNA- 
Seq data and apply it to ER stress-induced Irelu heterozygous and knockout mouse embryonic fibroblast (MEF) cell lines to 
identify additional Irela. targets. 

Results: We developed a bioinformatics approach called the Read-Split-Walk (RSW) pipeline, and evaluated it using two 
Irela heterozygous and two /re7a-null samples. The 26 nt non-canonical splice site in Xbpl was detected as the top hit by 
our RSW pipeline in heterozygous samples but not in the negative control Ire la knockout samples. We compared the Xbpl 
results from our approach with results using the alignment program BWA, Bowtie2, STAR, Exonerate and the Unix "grep" 
command. We then applied our RSW pipeline to RNA-Seq data from the SKBR3 human breast cancer cell line. RSW reported 
a large number of non-canonical spliced regions for 108 genes in chromosome 17, which were identified by an independent 
study. 

Conclusions: \N& conclude that our RSW pipeline is a practical approach for identifying non-canonical splice junction sites 
on a genome-wide level. We demonstrate that our pipeline can detect novel splice sites in RNA-Seq data generated under 
similar conditions for multiple species, in our case mouse and human. 
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Introduction 

The endoplasmic reticulum (ER) is the cellular organelle 
responsible for protein folding and assembly of approximately 
one-third of all eukaryotic cellular proteins [1]. The Unfolded 
Protein Response (UPR) is an adaptive intracellular stress response 
activated by the accumulation of unfolded or misfolded proteins 
within the ER lumen [2] and several reports indicate that ER 
stress is associated with diseases such as type II diabetes (T2D), 
metabolic disease, obesity, inflammation, neurodegeneration and 
cancer [3-6], Upon activation of the UPR, the protein Irela. 
(Inositol Requiring Enzyme 1 oc); also known as ERJV1 (endoplasmic 



reticulum to nucleus signaling 1), initiates the cytosolic (non- 
canonical) splicing of Xbpl (X-box binding protein 1) mRNA to 
remove a 26 nucleotide (nt) sequence that results in a shift of the 
translational-open reading frame [7]. Spliced Xbpl mRNA 
encodes a potent transcriptional activator of hundreds of UPR 
genes encoding functions that facilitate protein folding, secretion, 
and degradation in response to ER stress. Figure 1 shows the 26 nt 
sequence (highlighted in gray) - chrl 1:5424282-5424307 (refer- 
ence genome mm9) for the National Center for Biotechnology 
Information (NCBI) RNA reference sequence (RefSeq) 
No. NM_0 13842.3 spliced from the Xbpl gene. 
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5424056 



AAAGCCCGGATGAGCGAGCTGG 
AGCAGCAAGTGGTGGATTTGGA 
AGAAGAGAACCACAAACTCCAG 
CTAGAAAATCAGCTTTTACGGGA 
GAAAACTCACGGCCTTGTGGTTG 
AGAACCAGGAGTTAAGAACACG 
CTTGGGAATGGACACGCTGGATC 
CTGACGAGGTTCCAGAGGTGGA 
GGCCAAGGGGAGTGGAGTAAGG 
CTGGTGGCCGGGTCTGCTGAGTC 
CGCAGCACTCAGACTATGTGCA 
CCTCTGCAGCAGGTGCAGGCCC 
AGTTGTCACCTCCCCAGAACATC 
TTCCCATGGACTCTGACAC 



Splice out 26 bp 



aaagcccggatgagcgagctgg 
agcagcaagtggtggatttggaa 
gaagagaaccacaaactccagc 
tagaaaatcagcttttacgggag 
aaaactcacggccttgtggttga 
gaaccaggagttaagaacacgc 
ttgggaatggacacgctggatcc 
tgacgaggttccagaggtggag 
gccaaggggagtggagtaaggc 
tggtggccgggtctgctgagtcc 
zagcactcagactatgtgcac 
icagcaggtgcaggccca 
gtt(vcv:ctccccagaacatct 
tcccaVgcVctctgacac 




Figure 1. XbpVs partial transcript sequence, including the non-canonical 26 nt spliced region (highlighted on right). 

doi:1 0.1 371 /journal.pone.01 00864.g001 



Although the splicing of nitrons is inherently stochastic in terms 
of the mRNA sequence(s) required to define the 3' and 5' intron 
boundaries, there does exist a splicing code that predicts these 
boundary locations. Most splicing events fall into one of two highly 
conserved splicing categories: U2- and U12-dependent. Because 
U2 represents ~99% of human introns, U12 intron splicing is 
commonly referred to as non-canonical splicing [8] . However, U2- 
and U12- dependent splicing events occur within nuclear pre- 
mRNAs whereas IRE 1 oc-dependent Xbpl mRNA splicing occurs 
within the cytosol and has its own sequence recognition 
requirements [9-10], thus it is not only non-canonical in the 
sense of not being U2-dependent; it is uncharacteristic of any 
classification. 

In recent years, new sequencing technologies for expressed 
RNA (RNA-Seq [1 1]) have improved expression profiling. 
Advantages of RNA-Seq compared to microarrays include the 
ability to detect the entire transcriptome and increased sensitivity 
for detecting splice variants, splice junctions, and novel transcripts. 
RNA-Seq can be performed on any of several technology 
platforms; our report focuses on data from the Illumina Genome 
Analyzer, however, our approach is applicable to other platforms. 
After the expressed reads are sequenced, they can be aligned 
against the reference genome plus junctions between known exon 
boundaries to include reads that cross splice junction sites. 
However, this approach excludes reads that cross non-canonical 
or unknown splice sites, which are interpreted as having "gaps" in 
them. We hypothesized that this set of unmappable reads could 



thus contain non-canonical and novel sequences, such as the 
region surrounding the 26 nt intron in Xbpl. Since the intron 
spliced from Xbpl mRNA is a non-canonical splice event, it serves 
as an ideal positive control around which to design our RSW 
algorithm. Irela. null mouse embryonic fibroblast (MEF) cells 
provided a robust negative control, as they cannot produce the 
non-canonical splice in Xbpl mRNA. 

Several methods exist to identify novel splice sites in RNA-Seq 
data, including Alt Event Finder [12], TrueSight [13], Splicing- 
Compass [14], and PASTA [15]. Alt Event Finder [12] processes 
the mapped reads to report splice regions yet does not consider 
unmapped reads in the analysis input. Although TrueSight [13] 
utilizes all identified junctions and builds a regression model to 
report the best alignment, the algorithm is not suitable for our 
Xbpl-like non-canonical spliced region detection study due to the 
lack of pre-known Xbpl-like spliced regions for model training. 
SplicingCompass [14] maps reads to the reference genome initially 
and also does not utilize unmapped reads. It then predicts genes 
that undergo differential splicing based on the expression level, 
which is not applicable for identifying Xbpl-like spliced regions. 
PASTA [15] uses a combination of heuristic and statistical 
methods to identify canonical exon-intron junctions with high 
accuracy, especially at low coverage levels. PASTA [15] also uses 
the unaligned reads to infer the exact location of junction 
boundaries. Tophat also focuses on canonical exon-exon splice 
sites as well. In summary, none of these tools were designed for 
detecting the type of non-canonical, short splicing pattern 
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observed in IRE 1 a-dependent Xbpl mRNA splicing. Many of 
these tools (TrueSight [13], SplicingCompass [14], PASTA [15]) 
also make use of paired-end sequencing data. However, in the case 
of Xbpl and similar cases, paired-end sequencing data would not 
be beneficial since the 26 nt length of the spliced region is within 
the error limits of the estimated insert length (i.e., a difference of 
26 bp would be too small to reliably detect as significant). Thus, 
our method makes use of single-end sequencing reads. 

With this motivation in mind, we developed an algorithm and 
pipeline called the Read-Split- Walk (RSW), to specifically process 
unmapped reads from RNA-Seq data to identify non-canonical, 
possibly very short, spliced regions; we then applied this pipeline to 
identify genome-wide Zreia-dependent alternative spliced regions. 
The successful identification of the 26 nt Xbpl non-canonical 
splice site as the top hit in two separate experiments validates our 
algorithm and suggests no other mRNA splicing event dependent 
on IRE la is as strong as Xbpl's site under the contexts we 
examined. Although we only examined one time point, it was 
during the peak of IRE la-dependent Xbpl mRNA splicing; thus, 
the lack of additional non-Xbpl mRNA splice targets is consistent 
with what was reported in yeast, where HAC 1 was found to be the 
sole splicing substrate of IREla [16]. We compared our pipeline's 
results with related tools that might be used to detect novel non- 
canonical, short spliced regions such as BWA [17], Bowtie2 [18], 
and STAR [19]. In addition, we used known pattern detection 
(grep), and homology search approaches (Exonerate [20]) to 
estimate the sensitivity and precision of our RSW pipeline. 

Materials and Methods 

Origin of cell lines 

Two mouse embryonic fibroblast (MEF) cell lines were 
developed for testing for IRE la-dependent alternative spliced 
regions. In the first experiment using 35 bp reads, MEFs were 
prepared from day 9 embryos of Black 6 (B6) mice of greater than 
10 generations back-crossed and of cell passage 6. IRE la- 
dependent Xbpl mRNA splicing was knocked out using a floxed 
allele that removes the kinase domain and subsequently the RNase 
activity [21]. For the second experiment using 79 bp reads, MEFs 
were prepared from mice of the same B6 background except with 
a different null Irelcc knockout allele that results in a more 
complete gene deletion that removes both the kinase and RNase 
domains [22]. 

The human breast cancer cell line, SKBR3 was obtained from 
the American Type Culture Collection (Manassas, VA). The cells 
were maintained in culture with DMEM/F-12 medium supple- 
mented with 10% FBS (Tissue Culture Biologicals, Seal Beach, 
CA) and 1% of Antibiotic- Antimycotic 100X (Gibco, Carlsbad, 
CA) [23] . The SKBR3 cell line is a model for the aggressive Her2/ 
neu-amplified subtype of human breast cancers. 

Experimental design to detect Irelot splicing targets 

Mouse embryonic fibroblasts (MEFs) heterozygous (Het) Irela +/ ~ 
versus knockout (KO) Ireln 1 were treated for 4 hours with 
either 500 nM Thapsigargin (Tg) or 1 mM Dithiothreitol (Dtt) 
to induce the unfolded protein response (UPR). During these 
stresses, the Het cells are able to perform the non-canonical 
(26 nt) splicing of Xbpl mRNA, whereas Irela. 1 cells do not. 
Poly-A mRNA was isolated from these cells, sheared to 
approximately 250 nucleotides, randomly primed to form 
cDNA, and the Illumina linkers were added to the cDNA 
fragment library. Illumina sequencing (GAII) was performed on 
each sample. The Tg reads were 35 nucleotide single-end reads; 
the Dtt reads were 79 nucleotide single-end reads. The 



experiments with the 2 different treatments (Tg and Dtt) were 
conducted several months apart and changes in the Illumina 
platform resulted in differential read lengths and numbers of 
reads, allowing us to test RSW and compare the robustness of 
results under two independent, different sequencing approaches. 
These data have been deposited in NCBI's Gene Expression 
Omnibus (GEO) and are accessible through GEO Series 
accession number GSE54631. 

Initial read processing and reference genome selection 

We used the Illumina (www.illumina.com) Genome Analyzer 
Pipeline Software package (GAII) to perform a complete data 
analysis of the four RNA-Seq datasets. Specifically, we employed 
the GAPipeline and the Off- Line Base caller (OLB) pipeline v 1.6.1 
suites of modules to perform image analysis (Firecrest module) and 
base calling (Bustard module) for the Genome Analyzer. We then 
collected the original read sequence files in fastq format and 
downloaded the mouse genome reference sequences (mm9) from 
the University of California Santa Cruz (UCSC) genome browser 
(http://genome.ucsc.edu). We also downloaded UCSC genes 
(knownGene.txt) from UCSC genome browser. The splice 
junction file was created by setting the sequence entry on each 
side of the junction site to 4 bp shorter than the read length (35 or 
79 bp) using a RNA-Seq software python script (getsplicefa.py) 
from ERANGE version 3.1 (http://woldlab.caltech.edu/~alim/ 
RNA-seq/). The original reference genome and splice junction site 
file were merged together to form an expanded genome. 

Identification of novel spliced regions using the RSW 
algorithm 

Align read sequences to the mouse reference genome and 
split unmapped reads into halves. We used the Bowtie [24] 
aligner (version 0.12.7) to align original sequences against the 
expanded genome (mouse reference genome mm9 and known 
splice junction boundaries). We specified a maximum of 2 
mismatches (bowtie parameter -v 2) and instructed Bowtie to 
report up to 11 valid alignments (-k 11). The aligner wrote reads 
with more than 1 0 reportable alignments (mapping locations) into 
a separate file (-m 10, — max). Since the goal of this project was to 
identify special non-canonical spliced regions, the "best" align- 
ment option ( — best) was also used. Likewise, we only processed 
original reads deposited in the unmapped read data set ( — un 
option). The unmapped reads are used in the remainder of this 
report. 

Based on the overall average quality scores across the read 
length, we trimmed the last 2 bases of each read. Each unmapped 
read was used to create a small library of split pairs. The split pairs 
were generated by splitting the read at consecutive nucleotide 
positions, such that each end (left or right) of the read had a 
minimum of 1 1 and a maximum of 22 nucleotides if the total read 
length was 33 nucleotides (Tg treated samples) or a minimum of 
1 1 and a maximum of 66 nucleotides if the total read length was 
77 nucleotides (Dtt treated samples). Thus the sequential divisions 
between the left and right end across the length of the read 
produces a library of sequence pairs all coming from a single 
previously unmapped read; the resulting halves then undergo a 
new round of mapping that also keeps track of the gap/intron 
between halves (Figure 2). The RSW algorithm was performed on 
all unmapped reads from all samples. 

Select split read halves mapped to genome and selection 
criteria for novel splice locations. We define a split read pair to 
be two sequences that were originally from the same read. Each 
split read half was then aligned back to the original reference 
genome and exon boundaries using Bowtie [24] under the same 
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Figure 2. Illustration of a hypothetical potentially novel spliced region with two supporting reads. Below the gene are two examples 
that RSW would disregard: a split read pair mapped to the incorrect strand (gray) and a split read pair mapped with a 0 or 1 bp gap (orange). 
doi:1 0.1 371 /journal.pone.01 00864.g002 



parameter settings except that no mismatch (option -v 0) was 
allowed for this alignment step. Following alignment, split read 
pairs were selected for follow-up analysis if they met certain 
criteria. We required that two split read ends map to within 40 kb 
of each other and map to the same strand and same chromosome, 
together with the same matched gene. We did not consider reads 
that were inversely mapped (right end mapped before left end on 
the + strand, or left before right on the - strand). We also filtered 
out pairs that were adjacent (spliced region length <2 bp), since 
this could occur based on sequence homology on either side of the 
split, and thus were likely false positives. The spliced region was 
defined as the interval between the nearest ends of the two split 
halves from the same read. The splice length was calculated as the 
distance between the alignment end position of the read split end 
having lower mapped genomic coordinates and the alignment start 
location of the read split end having higher mapped genomic 
coordinates. 

All the selected split read ends for each read were consolidated 
into a unique spliced region if the genomic boundaries for their splice 
length were within a certain threshold (1 1 bp for Tg treated 
samples and 27 bp for Dtt treated samples) and their splice length 
was the same. These equivalendy mapped reads were then 
consolidated into unique spliced regions. The final candidate 
spliced region for a gene was calculated as the range between the 
smallest lower boundary and the largest upper boundary among 
the regions of supportive reads where the splice length for every 
mapped read was the same (see Figure 2). 

If a spliced region was within 5 bp of a known junction in the 
UCSC knownGene reference database or it was not supported by 
at least 2 independent reads, then it was not considered a novel 
candidate. Otherwise, the region was reported as a novel, 
candidate non-canonical region. We note that we report 
candidates supported by as low as 2 independent reads for 
completeness, so that users can observe the distribution of 
candidates by number of supported reads. These distributions in 
our data suggest that those supported by only 2-A reads are likely 
false positives. Figure 2 illustrates the criteria we used for our 
algorithm for Tg treated samples. 

To compare splice locations between the Het and KO cells, we 
allowed a 5 bp variance at either end of the splice. Clusters with 



multiple supporting reads in the Het sample but having no reads 
or very short splice length (<2 bp) in the KO sample were 
considered candidates. The results from both runs were combined 
and manually inspected for regions present across both treatments 
(Tg and Dtt). The step of selecting candidate split read pairs was 
performed in Oracle using SQL and the clustering algorithm was 
implemented in Perl. Scripts are available at http://sartorlab. 
ccmb.med.umich.edu/ rsw. 

Assessing the robustness of our choice of parameters. To 
test the robustness of our approach, we also assessed the effect of (1) 
changing the minimum split length (from 1 1 to 1 3), (2) changing the 
alignment parameter for mapping the split reads to allow 1 
mismatch (original analysis allowed 0 mismatches), and (3) changing 
the minimum number of reads supporting a spliced region (from 2 
to 5) for the Tg experiment. We did not assess the effect of changing 
the requirements that reads to map to the same strand, the same 
chromosome, and the same matched gene since these are critical 
requirements for our hypothesis. 

Computational requirements for RSW. Computational 
requirements include Perl (v5.7 or later) and pre-installment of 
the alignment software bowtie (v0.12.7 or later) [24] in a location 
accessible in the path. In addition, the desired reference genome 
and relevant bowtie index files should be built. The UCSC 
refFlat.txt annotation file for the desired species should be in the 
same directory. At least 1 GB of RAM is recommended. 

BWA aligner to identify novel splices or gaps 

We compared our algorithm to BWA [17] software, which uses 
gapped alignment, to test whether the known Xbpl 26 nt spliced 
region could be identified using BWA. We modified the BWA 
aligning parameters to allow for larger gaps, adjusting for the 
number of gap extensions (length of gap (-e from 26 to 100 
(maximum number of gap extensions), and altering the gap open 
penalty (-O 15-20) and the gap extension penalty (-E 0-4). 
Reported results are based on the BWA backtrack algorithm 
(Version 0.6.2-rl26) "ain" command with -I (the input was in the 
Illumina 1.3+ FASTQ;like format), -O (gap open penalty was set 
to 15), -E (gap extension penalty was set to 0), and -e (the 
maximum number of gap extensions was set to 100). The BWA 
algorithm with the "samse" module was used to generate 
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alignments. Results generated using the other above-mentioned 
parameters were the same. 

Bowtie2 aligner to identify novel splices or gaps 

We additionally used bowtie2 [18], which as opposed to the 
original Bowtie allows gapped alignment of short reads. We used 
different combinations of gap open penalty (5, 7, 10, 15, 20, and 
30) and gap extension penalty (0, 1, and 3) on Tg and Dtt data sets 
and compared the results with those from RSW. 

Running Exonerate and Unix "grep" command 

In order to validate our RSW algorithm, we used the Xbpl 26 nt 
spliced region as a positive control case. To determine the total 
number of sequencing reads present in the heterozygous samples 
that spanned the 26 nt spliced region, we used a 33 bp DNA 
fragment flanking the novel 26 nt splice and identified reads 
aligning to this region using homology mapping (Exonerate [20]) 
and pattern matching (grep). 

We ran Exonerate [20] version 2.2.0 in default mode (with 
perfect matches) as well as exploring different percent matched 
identification to account for 1-2 possible sequencing errors in the 
reads. Specifically, the command used was: Exonerate -percent 95 
-showalignment FALSE -showvulgar false -ryo '%S\n' FAS- 
TAl.fa FASTA2.fa. In the case of the search for Xbpl spliced 
regions, we used the normal Xbpl region or the spliced Xbpl for 
FASTAl.fa and the FASTA2.fa files, respectively, which con- 
tained the reads in "fasta" format. By default, Exonerate uses a 
local alignment model allowing for small gaps. The 26 gap is too 
large for Exonerate to identify in short sequencing reads; thus we 
ran it twice, once with the unspliced and once with the spliced 
version of Xbpl. 

The Unix regular expression command grep was used to identify 
the exact pattern of the same 33 bp fragment spanning the 26 nt 
Xbpl splice in all of the unmapped reads by searching for the Xbpl 
26 nt flanking sequence (concatenating upstream and downstream 
flanking sequences) with different combinations of read split pair 
lengths. The sum of these split read pair lengths is equal to the 
trimmed read length (33 for Tg treated samples and 77 for Dtt 
treated samples). 

Identification of novel spliced regions for SKBR3 using 
the RSW algorithm 

In order to demonstrate that the RSW pipeline can be applied 
to sequencing data in other biological contexts and from humans, 
we employed RSW on RNA-Seq data from the human breast 
cancer cell line SKBR3 [23]. This study was conducted by the 
Chromosome 17 team of the Chromosome-centric Human 
Proteome Project (C-HPP) of the Human Proteome Organization 
(HUPO) [25-27]. HPP analyses involve integration of proteomics 
data into a genomic framework that will promote a better 
understanding of the relationship of the transcriptome to the 
proteome and of the pathways and biological networks involved in 
the phenotype [28]. Despite its relatively small size, chromosome 
1 7 is rich in protein-coding genes, ranking second in gene density; 
it contains many cancer-associated genes, including BRCA1, 
ERBB2 (Her2/neu), TP53, and genes of the ERBB2 amplicon, 
making it an ideal region to use to test RSW. 

Previously, splice variants expressed in the SKBR3 cell line were 
examined by integrating RNA-Seq and proteomic mass spec- 
trometry data. From total RNA of SKBR3 cells, strand-specific 
RNA-Seq libraries were prepared according to Illumina TruSeq 
standard procedures. Each library was sequenced (101 bases, 
paired end) on 1-3 HiSeq 2000 lanes to obtain an average of 123 



million uniquely mapped reads 4 . The mass spectrometric analyses 
on SKBR3 cell lysate samples were performed with an online 
Dionex nano-LC instrument coupled to a Fourier transfer mass 
spectrometer 4 . The criteria for splice variant identification used 
were the presence of unique reads aligning only to the transcript 
and matching peptide. 

We used the same aligner, Bowtie [24], to align SKBR3 RNA- 
Seq reads against the expanded genome (human reference genome 
hgl9 plus known splice junction boundaries) and used the same 
Bowtie parameter sets to obtain unmapped read sequences. We 
also trimmed the end two bases of each read. Again, we split 
unmapped read into halves, creating a small library of split pair 
reads. The split pairs were generated using a minimum of 33 and a 
maximum 66 nucleotides given that the total read length for the 
SKBR3 cell line RNA-Seq data was 99 nucleotides after trimming. 

As performed for the mouse Ire la. study, we applied the same 
strategy of mapping read pairs to the genome and performed a 
similar SQL query to select mapped read pairs meeting certain 
criteria. Again, we removed false positive pairs (reads that were 
inversely mapped) and filtered out pairs that were adjacent. The 
spliced region and splice length were also defined similarly as our 
mouse analysis. For example, the same criteria (within 5 bp of a 
known junction in the UCSC knownGene reference database) was 
applied. The downstream read clustering and consolidation steps 
were implemented using the same algorithms /scripts. We 
performed separate analyses for the two ends of the paired-end 
RNA-Seq data and also ran RSW on the data set combining both 
ends of the paired-end data. 

To check for the prevalence of splice regions detected by RSW 
in the SKBR3 cell line, we used the Ensemble annotation file 
(Homo_sapiens.GRCh37.74.chrl7.gtf downloaded from http:// 
useast.ensembl.org/info/data/ftp/index.html) to annotate the 
splice junction boundaries of 217 protein isoforms from 108 genes 
on chromosome 17 and then checked exon boundaries against 
candidate splice junction coordinates reported by RSW. To be 
called a match, the two identified splice junction boundaries were 
required to be within 5 bp from each other. The presence of at 
least one matched junction was required to claim an overlapped 
region between the two methods (proteomics and RSW). 

Results 

Mapping statistics 

The Bowtie mapping statistics for Het samples in both 
experiments (Tg and Dtt) are shown in Table 1 . The Dtt-treated 
Het sample contained both more total reads, and more unmapped 
reads when compared to the Tg-treated Het sample. The mapping 
statistics for KO samples were similar (not shown). 

Change in coverage at XBPVs non-canonical intron 

The aligned reads within and spanning the Xbpl 26 nt spliced 
region were visualized in a UCSC custom track (http: //genome. 
ucsc.edu/) for both genotypes, and showed a distinct lack of reads 
mapped onto the 26 nt spliced region in the Het samples for both 
experiments (Figure 3). In contrast, this region was well covered 
throughout by reads in the KO samples. 

Novel spliced region results for XBP1 detected by RSW 

The number of reads from the unmapped set that were 
identified as crossing over the Xbpl 26 nt region by the RSW 
algorithm were 21 and 173 for Tg and Dtt Het samples, 
respectively (Table 2). Actual lists of read halves and associated 
information supporting Xbpl's 26 nt spliced region is reported in 
Table SI. This 26 nt spliced region in Xbpl was the top candidate 
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Table 1. Bowtie Mapping statistics for lrela +/ samples treated with Tg and Dtt. 





Sample Type 


Read Length 


Total # of reads 


Total # of unmapped reads 


Tg treated 


35 


8,922,329 


418,002 (4.7%) 


Dtt treated 


79 


23,442,681 


2,055,557 (8.8%) 



doi:1 0.1 371 /journal.pone.01 00864.t001 
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B) 



Scale 58 bases | 1 rnm9 

Chrll: | 5,424,2561 5,424,2691 5,424,2791 5,424,2881 5,424,2991 5,424,3881 5,424*3181 5,424,3281 5,424,3381 



> flGGGGfiGTGGflGTflflGGCTGGTGGCCGGGTCTGCTGflGTCCGjCflGCflCTCflGflCTflTGTGCflCCTCTGpfiGCfiGGTGCRGGCCCflGTTGTCnCCTCCCCflG 

DTT treated Het 



Xbpl 

Xbpl I 



UCSC Genes (RefSeci, 
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chrll 



S8 bases (- 



: | 5,424,2591 5,424,2691 5,424,2791 5,424,289| 
> flGGGG AGTGGflGTftflGGCTGGTGGCCGGGTCTGCTGflGTCCG 



GenBank, tRNfts & Comparative 



5,424,298| 5,424,388| 5,424 
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51 9| 5,424,3281 S, 424, 3361 
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Xbpl 
Xbpl 
Xbpl 




Figure 3. UCSC custom track view for Xbp1 26 nt spliced region in mouse embryonic fibroblast cells and qRT-PCR results. "Het" = 
fre/a +/ ~; "KO" = irelaT , a = 500 nM Thapsigargin 4 hrs; b = 1 mM Dithiothreitol 4 hrs; For both a and b, Het reads are on top and KO reads are on 
the bottom. Red vertical lines indicate the 26 nt spliced region. 
doi:1 0.1 371 /journal.pone.01 00864.g003 
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Table 2. The number of reads 
/re7a _/_ samples. 


aligned to Xbpl's 


26 bp splice junction by 


our "Read-Split-Walk" (RSW) algorithm for Irelct' and 




Sample type 


Gene 


Splice length 


Splice region 


Supported by # of reads 


Het KO 


Tg treated 


Xbpl 


26 


chrl 1:5424280-542431 2 


21 0 


DTT treated 


Xbpl 


26 


chrl 1:5424280-5424312 


173 0 
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region predicted by RSW for both experiments using the criteria 
described in the methods. For comparison, the next candidate for 
Tg was Fhll, supported by only 4 reads with splice length of 
1797 bp (not identified in Dtt), and the next candidate for Dtt was 
Rps9, supported by 80 reads with a splice length of only 4 bp (not 
identified in Tg). Using primers flanking the Rps9 splice site on 
genomic DNA we confirmed the 4 nt gap is genomic in nature 
rather than being spliced at the mRNA level by IRE la (not 
shown). Since no other candidate region met the criteria in both 
experiments, no other candidate genes were identified showing 
clear non-canonical spliced regions by Irela.. The list of top ranked 
genes identified by RSW from the Tg and Dtt samples and their 
supporting read information is provided in Table S2. 

RSW results using alternative parameters 

We assessed the effect of (1) changing the minimum split length 
from 11 to 13 bp, (2) changing the alignment parameter for 
mapping the split reads to allow 1 mismatch (original analysis 
allowed 0 mismatches), and (3) changing the minimum number of 
reads supporting a spliced region from 2 to 5. Increasing the 
minimum split read length to 13 still resulted in the Xbpl 26 bp 
splice site as the top ranked site, and supported by the same 
number of reads (21 for Tg and 173 for DTT treated samples). 
This is because the minimum split length that mapped to the 
borders of the Xbpl intron and no more than 9 other places was 
14 bp (See Table SI). Allowing 1 mismatch (— v 1 parameter) in 
each split read resulted in zero read pairs selected to support the 
26 bp region. This was due to the fact that many mates for read 
split pairs were filtered into the Bowtie max file having been 
mapped to > 1 0 genomic locations when one mismatch was 
allowed. Requiring at least 5 supporting reads for a spliced region 
reduced the number of reported candidates from 55 to 1 for Tg 
(and from 2533 to 89 for Dtt), with no change in ranking since 
ranks were based on this criteria. While RSW by default reports 
candidates having as low as 2 supporting reads for completeness, 
those candidates supported by few reads should be viewed with 
scepticism. Inclusion of those supported by few reads in the report 
facilitates judging the significance of top hits such as Xbpl. These 
results suggest that the default running parameter settings for 
RSW work well. 

RSW results under the null expectation of no novel 
spliced regions 

To assess the specificity of RWS, we ran our pipeline on a subset 
of the Tg data set for reads that mapped uniquely to the reference 
genome (chrl 9); in this case we do not expect to identify any novel 
spliced regions since the complete reads map to the genome. 
Indeed, we did not observe any novel spliced regions from this 
read set that met all of our criteria. (Several spliced regions known 
by UCSC were identified and filtered.) Extending this to the full 
genome suggests we can expect no more than a minimal number 
of false positives. 



RSW sensitivity compared with alternative approaches 

We compared the RSW algorithm results using the Dtt treated 
Het sample with two different approaches (grep and Exonerate) for 
identifying the number of reads which could be detected spanning 
the 26 bp spliced region in the positive control gene, Xbpl. The 
Venn diagram comparison of reads detected by the three 
approaches is shown in Figure 4. We identified 204 and 141 
reads spanning the 26 bp spliced region in Dtt-treated Het 
samples using Exonerate and the Unix grep command, respectively, 
and no reads in either of the KO samples. In contrast, the RSW 
algorithm identified 173 reads. Among these three methods, 118 
unmapped reads are considered in common. Exonerate missed 
several reads (See Figure 4) that were detected by RSW and vice 
versa, however, it is important to note that neither Exonerate nor 
the Unix grep approach are applicable for mapping genome-wide 
data; they can only be used for predefined candidate spliced 
regions. Conversely, RSW was designed to be utilized on complete 
unaligned genome-wide data sets. 

Comparison of RSW with BWA 

We used the BWA [17] alignment software (backtrack 
algorithm), which allows for gaps in individual reads, to assess 
whether the non-canonical spliced region could be detected in this 
manner. Even with significant modifications to the gap open and 
extend penalties and the length of gap allowed, we could not 
identify any of the reads which spanned the 26 nt gap in Xbpl in 
Tg-treated samples. In Dtt-treated samples, BWA only identified 
67 of the reads (Table S3). Changing the gap open and gap 
extension penalty parameters for BWA, we could not identify any 
more of the reads spanning the splice site. Using bwa-mem and bwa- 
sw (alternatives to bwa backtrack) did not identify any reads for the 
26 bp splice site for either the Dtt or Tg experiments. This is in 
contrast to RSW, which clearly identified the 26 nt spliced region 
as the top hit with a significantly higher number of supporting 
reads in both experiments. 

Comparison of RSW with Bowtie2 

Additionally, we used bowtie2 [18] to determine whether the 
known Xbpl 26 nt spliced region could be identified. By using gap 
extension penalty 1, Bowtie2 recovered most of the reads (166/173 
for gap open penalty 15, and 171/173 for open penalty 5) 
supporting the region of 26 bp in the Dtt sample yet no supporting 
reads were detected in the Tg data set where RSW reported 21 
reads and ranked the Xbpl 26 nt region first. Even for the Dtt 
sample, bowtie2 did not rank the Xbpl 26 nt region on top for any 
parameter combination (at best rank = 62 within chr 1 1 for gap 
open penalty of 5, with substantially lower rank expected for the 
entire genome). 

Comparison of RSW with STAR 

We applied the STAR (v2.U) [19] software with our 
unmapped read set, which was designed to align non-continuous 
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204 reads identified by 
Exonerate program 



141 reads identified by 
Unix "grep" command 




173 reads identified 
by RSW algorithm 



XX 
XX 

XX 
XX 



Reads with one or two end side 
having mismatches with 
reference genome 

Missing reads in which one 
of read halves was kept in 
Bowtie max file or aligned 
to a different region 



•III 



Read halves mapped onto 
known splice junctions 



Reads identified between 
RSW and Unix "grep" 
command 



XX 
XX 



Reads mapped onto 
negative strand NOT 
found by Exonerate 



Reads mapped onto 
negative strand found 
by Exonerate 



Figure 4. Comparison of supporting reads for XbpVs 26 nt splice junction site in the Dithiothreitol (Dtt) sample from RSW, 
Exonerate and the Unix "grepf' command. 
doi:1 0.1 371 /journal.pone.01 00864.g004 



sequences directly to the reference genome. In both Tg and Dtt 
samples, STAR did not identify the 26 bp spliced region for Xbpl 
(See Table 3). 

Running RSW on SKBR3 human Chromosome 17 breast 
cancer paired-end RNA-Seq data set 

The SKBR3 proteomic analyses of chromosome 17 reported 
217 distinct protein isoforms from 108 genes with transcript and 
peptide evidence of novel, alternative splicing; these 108 genes are 
annotated in the Ensembl database as genes with more than one 
protein-coding transcript [23]. All of the 108 genes were identified 
as having novel spliced regions by RSW (Table 4). Among the 217 
Ensembl protein isoform IDs reported by the SKBR3 proteomics 
analysis, 214 (98.6%) of these isoforms were found to share at least 
one common exon junction with our RSW results (Table S4). 
RSW results also indicated that these 108 genes expressed in the 
SKBR3 cell line have a large number of novel spliced regions 
when compared to other genes (data not shown). A total of 261 
novel splice regions from this 108 gene data set were supported by 
two or more reads. A list of novel candidate spliced regions is 
provided in Table S5. 



Discussion 

Technical considerations for the RSW pipeline 

Using Bowtie alignment, the unmapped reads dataset changes 
when different alignment parameters, e.g., -m, -k, and/or -v, are 
used. In our case, after testing several different parameter 
combinations, it appears that the default settings work well for 
the data. However, better combinations may exist, and different 
parameter combinations may be optimal for different species or 
datasets. In our pipeline, we fixed certain parameters, such as the 
minimum split read size after exploring a limited set of reasonable 
values. It is worth considering how these choices affect the 
outcome for different datasets. For example, there is a trade-off 
between the lower/upper bound split size vs alignment sensitivity. 
By allowing a read half to map only up to 10 locations, we are 
specifying that the chance that it maps near the other read split 
half and in the correct configuration is small; if the minimum read 
split size is increased, then fewer read halves will map to >10 
locations. Similarly, there is a trade-off between shorter/longer 
consolidation slip/buffer size. We used 5 bp because the minimum 
read half length used was 1 1 bp for both runs and therefore our 
buffer size is half of the minimum read half length. More or fewer 
supporting reads may be reported if different cut-off criteria are 
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Table 4. Counts of spliced regions detected by RSW and proteomics from the SKBR3 cell line. 





dataset 


Read 1 only 




Read 2 only 




Both Read1&2 


Gene Set 


107 


582 


108 


620 


656 




Both methods 


RSW only 


Both methods 


RSW only 


RSW only 


Novel Isoform 


398 


1022 


435 


1105 


424 


Known Isoform 


868 


2634 


956 


3145 


2692 


Total Isoform 


1266 


3656 


1391 


4250 


3116 



Read 1 refers to the first half of each paired-end read; Read 2 refers to the second half of each paired-end read. Notes: 107 Gene Set (Found by both proteomics and 
RSW methods using the first half of each paired-end read); 582 Gene Set (Found by RSW only using the first half of each paired-end read); 108 Gene Set (Found by both 
proteomics and RSW methods using the second half of each paired-end read); 620 Gene Set (Found by RSW only using the second half of each paired-end read}; Both 
Read1&2 (Common junctions identified by both halves of paired-end data. They are located within the same gene, have the same splice length, and share the same 
splice boundary). 

doi:1 0.1 371 /journal.pone.01 00864.t004 



applied; these can be adjusted to achieve the desired balance 
between sensitivity and specificity in the specific application. 
Several factors will also influence the detection power of RSW. 
These factors include: total read coverage (total number of 
experimental reads); strength of the splicing event(s), i.e. the 
proportion of the target transcripts that are spliced, with spicing 
events that occur at a higher rate being easier to detect; gene 
expression level, with splicing of more highly expressed genes 
being easier to detect; and stringency of filtering parameters used. 

Maximum distance between two split halves 

For mouse MEF data, we selected a 40 kb distance between two 
split read halves to serve as the maximum expected splice length of 
an IRE la target. The distance of 50 kb was used for human breast 
cancer data. However, the question remains as to whether this was 
the best selection criterion for read pairs. We chose 40 kb for 
mouse (50 kb for human) as an arbitrary maximum because the 
average gene length is 47 kb for mouse genome (56 kb for human 
genome) based on the UCSC mm9 refFlat file in mouse and we 
assumed the maximum splicing length for a given gene would be 
less than the average total gene length. Using a shorter maximum 
distance will minimize false positives, while a longer maximum 
distance may increase sensitivity. If the maximum split length is 
increased, users may compensate by increasing the minimum split 
half size and/or decreasing the maximum number of mapping 
places in order to avoid an excess of false positive mappings in that 
region. We also used a very wide range (2 bp, 40 kb) of possible 
splice lengths, since we had data for two Ire la. null cell lines that we 
could intersect to reduce false positives. Users who do not have this 
advantage my well want to reduce the range of acceptable splice 
lengths. Although only recommended for longer reads under 
stringent parameter conditions, it is also possible to run RSW a 
second time without any filter on split length to interrogate 
whether trans-mRNA splicing may occur. 

The comparative performance of the RSW algorithm 

The splitting of reads into multiple read half pairs substantially 
increased the dataset size. Assuming all of the 218 unmapped 
reads supporting the known Xbpl 26 nt splice site are true 
positives, RSW has an estimated 79% (173/218 overlap between 
RSW and both grep and Exonerate) true positive rate (sensitivity). 
The false positive rate of RSW for the Xbpl site is estimated to be 
0%, since all reads identified by RSW appear to be true positives 
at least with regard to their sequencing. When compared to the 
results generated from the Unix grep command, RSW only missed 
reads in which one of the read halves was kept in Bowtie's max file 



(The output file contains all reads with a number of valid 
alignments exceeding the limit set with the -m option) or aligned 
to a different region (7 reads). Exonerate [20] results included 
reads where one or both read ends from a pair mapped onto a 
nearby known splice junction (25 reads), and thus were not part of 
the unmapped reads that RSW was starting from. The remaining 
13 unmapped reads identified using Exonerate are possible false 
positives because those reads have one or two mismatches beyond 
the 33 bp query sequences of the Exonerate input. 

Given the reasons above, we conclude that, by using different 
parameter settings for Bowtie [24] aligner along with RSW, we 
achieved better sensitivity than the other three methods for 
detection of Xbpl's IREla dependent intron. BWA [17] was not 
able to detect the relatively large 26 nt splice as a gap in any of the 
reads. It is worth stressing that both Exonerate and use of the Unix 
grep command require prior knowledge of the flanking sequence to 
check against unmapped reads, whereas our RSW uses the 
unmapped read set to check against the entire reference genome. 
At best, Bowtie2 was only able to detect the 26 nt splice region in 
the Dtt experiment, and even then it was ranked low (62 just 
within chrl 1). Thus, the RSW method performs best for genome 
wide de novo discovery of such short, non-canonical, or otherwise 
previously unknown, spliced regions. Our pipeline also compares 
and selects both samples' information side-by-side. A similar 
analysis using our RSW algorithm can be extended to other 
species, as demonstrated with our analysis of the SKBR3 human 
breast cancer cell line. 

Conclusions 

The positive control for our application, the Xbpl 26 nt non- 
canonical spliced region, was clearly detected in Ire la. heterozy- 
gous samples from treated samples but not in the negative control 
samples, and was our top hit for an Ire la. target splice site with 
both ER stress-inducing treatments. We identified a small number 
of other potential novel splice locations, each with variable size. 
However either the number of reads supporting these splices was 
much lower than observed for Xbpl, the splice length was likely too 
small or large to be a true positive, the results were only found in 
one treatment or the gap was genomic in origin. We have also 
demonstrated the application of RSW on well-characterized 
splicing events (via mass spectrometry) in the SKBR3 human 
cancer cell line. Using this method to detect non-canonical or 
otherwise unknown spliced regions in unmapped mRNA-Seq 
data, we conclude that Xbpl mRNA is likely the only significant 
splicing target of IREla in the MEF transcriptome. However, it is 
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possible that other IRE 1 oc-dependent splice targets escaped 
detection if they form after the peak of Xbpl splicing. Therefore, 
future experiments using longer ER stress treatments of control 
and Ire la knock-out human secretory cells are planned. 

Supporting Information 

Table SI Reads mapped by the RSW pipeline on Xbpl's 
26 nt spliced region in the Tg and Dtt Irelct'~ samples. 

(XLSX) 

Table S2 A list of genes with novel spliced regions 
detected by RSW and their read supporting information. 

(XLSX) 

Table S3 A list of reads supporting Xbpl 26 bp spliced 
region identified by BWA. 

(XLSX) 



Table S4 Overlap between RSW results with exon 
junctions of protein isoform IDs identified by another 
SKBR3 study. 

(XLSX) 

Table S5 A list of novel splice junctions reported by 
RSW algorithm for human breast cancer chromosome 
17 RNA-Seq data. 

(TXT) 
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